overview
Full image here of workflow is here: overview
scRNA seq is a very variable process and each dataset has unique QC problems. Though our simplified approach often works, we also need many more tools in our toolbox to handle more complex datasets. Often it is a case of trial and error to see what approaches work best for your data.
Cell Ranger is a suite of tools for single cell processing and analysis available from 10X Genomics.
In this session we will demonstrate the use of Cell Ranger Count tool to process our generated fastQ and create the required files. ]
]
Cell Ranger is available from the 10x genomics website.
Also available are pre-baked references for Human and Mouse genomes (GRCh37/38 and GRCm37)
] ]
Download the software
wget -O cellranger-7.2.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-7.2.0.tar.gz?Expires=1701688001&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=Puwpsqsf~wMQz5e~PwTvM2DRQO1XdJ~9zeLCWqX6tVbOx~dnf24hP1bwlmNhybr3SZUQ8C12ywcICMH6Au02wxiCRm1uuTxZ0Uvq8g~s8L8s6XFyhepdi6Qjq8dzXNGoxswg3hModjKWVptTWq-MTHBDZv~yTFB7QAM9lzHHXo6SPWg8Fnx30ngmtGC5tDReVOiJ3DY0hsFvZtG3HaQ-HEbnzEH3lre-0rpWMBlsQu-vZ4RnKE0o3Xv6pQsb6261M19nHcpCsGhDCkFjDDbradx~SNw5rpY-HMxM4SnRuaOOI0rYyDNn7xdTat3eFj7rlgATXRaYx7SYNqDYKSrNWw__"
Download reference for Human genome (GRCh38)
wget -O https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
Unpack software and references.
tar -xzvf cellranger-7.2.0.tar.gz
tar -xzvf refdata-gex-GRCh38-2020-A.tar.gz
export PATH=/PATH_TO_CELLRANGER_DIRECTORY/cellranger-7.1.0:$PATH
Now we have the downloaded Cell Ranger software and required pre-build reference for Human (GRCh38) we can start the generation of count data from scRNA-seq/snRNA-seq fastQ data.
Typically FastQ files for your scRNA run will have been generated using the Cell Ranger mkfastq toolset to produce a directory a FastQ files.
We can now use CellRanger count command with our reference and fastQ files to generate our count matrix and associated files.
If you are analyzing single nuclei RNA-seq data remember to set the –include-introns flag.
cellranger count --id=my_run_name \
--fastqs=PATH_TO_FASTQ_DIRECTORY \
--transcriptome=/PATH_TO_CELLRANGER_DIRECTORY/refdata-gex-GRCh38-2020-A
If you are working with a genome which is not Human and/or mouse you will need to find another source for your Cell Ranger reference.
To create your own references you will need two additional files.
Used to genome annotation.
Stores position, feature (exon) and meta-feature (transcript/gene) information.
Importantly for Cell Ranger Count, only features labelled as exon (column 3) will be considered for counting signal in genes
Many genomes label mitochondrial genes with CDS and not exon so these must be updated
Now we have the gene models in the GTF format we can use the Cell Ranger mkgtf tools to validate our GTF and remove any unwanted annotation types using the attribute flag.
Below is an example of how 10x generated the GTF for the Human reference.
cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf \
Homo_sapiens.GRCh38.ensembl.filtered.gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lncRNA \
--attribute=gene_biotype:antisense \
--attribute=gene_biotype:IG_LV_gene \
--attribute=gene_biotype:IG_V_gene \
--attribute=gene_biotype:IG_V_pseudogene \
--attribute=gene_biotype:IG_D_gene \
--attribute=gene_biotype:IG_J_gene \
--attribute=gene_biotype:IG_J_pseudogene \
--attribute=gene_biotype:IG_C_gene \
--attribute=gene_biotype:IG_C_pseudogene \
--attribute=gene_biotype:TR_V_gene \
--attribute=gene_biotype:TR_V_pseudogene \
--attribute=gene_biotype:TR_D_gene \
--attribute=gene_biotype:TR_J_gene \
--attribute=gene_biotype:TR_J_pseudogene \
--attribute=gene_biotype:TR_C_gene
Following filtering of your GTF to the required biotypes, we can use the Cell Ranger mkref tool to finally create our custom reference.
cellranger mkref --genome=custom_reference \
--fasta=custom_reference.fa \
--genes=custom_reference_filtered.gtf
Having completed the Cell Ranger count step, the user will have created a folder named as set by the –id flag for count command.
Within this folder will be the outs/ directory containing all the outputs generated from Cell Ranger count.
The count matrices to be used for further analysis are stored in both MEX and HDF5 formats within the output directories.
The filtered matrix only contains detected, cell-associated barcodes whereas the raw contains all barcodes (background and cell-associated).
MEX format - filtered_feature_bc_matrix - raw_feature_bc_matrix
HDF5 format - filtered_feature_bc_matrix.h5 - raw_feature_bc_matrix.h5
The outs directory also contains a BAM file of alignments for all barcodes against the reference (possorted_genome_bam.bam) as well as an associated BAI index file (possorted_genome_bam.bam.bai).
This BAM file is sometimes used in downstream analysis such as scSplit/Velocyto as well as for the generation of signal graphs such as bigWigs. ]
]
Cell Ranger also outputs files for visualisation within its own cloupe browser - cloupe.cloupe.
This allows for the visualisation of scRNA-seq/snRNA-seq as a t-sne/umap with the ability to overlay metrics of QC and gene expression onto the cells in real time
Cell Ranger will also output summaries of useful metrics as a text file (metrics_summary.csv) and as a intuitive web-page.
Metrics include
There are many potential issues which can arise in scRNA-seq/snRNA-seq data including -
]
Assessment of the overall quality of a scRNA-seq/snRNA-seq experiment after Cell Ranger can give our first insght into any issues we might face.
The web summary html file contains an interactive report describing the most essential QC for your single cell experiment as well as initial clustering and dimension reduction for your data.
The web summary also contains useful information on the input files and the versions used in this analysis for later reproducibility. ]
]
The first thing we can review is the Sample information panel.
]
]
The Sequencing panel highlights information on the quality of the illumina sequencing.
]
]
Key Metrics:
* Q30 Bases in RNA read > 65% (usually > 80%)
+ Reflect to Sequencing quality
+ Need to check with sequencing service supplier
* Sequencing saturation > 40% (usually range 20% ~ 80%)
+ Reflects the complexity of libraries
+ shall consider, but not necessarily, re-construct library while it too low
The Mapping panel highlights information on the mapping of reads to the reference genome and transcriptome.
]
]
Key Metrics:
* Mapped to Genome > 60% (usually range 50% ~ 90%)
+ Mapping rate to reference genome
+ Check reference genome version if too low
* Reads Mapped Confidently to Transcriptome > 30% (usually > 60%)
+ Reflection of annotation to transcriptome
+ Check annotation if too low
The Cells panel highlights some of the most important information in the report, the total number of cells captured and the distribution of counts across cells and genes.
]
]
Key Metrics:
The Cell panel also includes and interactive knee plot.
The knee plot shows:-
On the x-axis, the barcodes ordered by the most frequent on the left to the least frequent on the right
On the y-axis, the frequency of each ordered barcode.
Highlighted in blue are the barcodes marked as associated to cells.
]
]
It is apparent that barcodes labelled blue (cell-associated barcodes) do not have a cut-off based on UMI count.
In the latest version of Cell Ranger a two step process is used to define cell-associated barcodes based on the EmptyDrops method (Lun et al.,2019).
If required, a –force-cells flag can be used with cellranger count to identify a set number of cell-associated barcodes.
]
]
The Knee plot also acts a good QC tools to investigate differing types of single cell failure.
Whereas our previous knee plot represented a good sample, differing knee plot patterns can be indicative of specific problems with the single cell protocol.
In this example we see no specific cliff and knee suggesting a failure in the integration of oil, beads and samples (wetting failure) or a compromised sample.
]
]
If there is a clog in the machine we may see a knee plot where the overall number of samples is low.
]
]
There may be occasions where we see two sets of cliff-and-knees in our knee plot.
This could be indicative of a heterogenous sample where we have two populations of cells with differing overall RNA levels.
Knee plots should be interpreted in the context of the biology under investigation.
]
]
The web-summary also contains an analysis page where default dimension reduction, clustering and differential expressions between clusters has been performed.
Additionally the analysis page contains information on sequencing saturation and gene per cell vs reads per cell.
]
]
The t-sne plot shows the distribution and similarity within your data.
]
]
The sequence saturation and Median genes per cell plots show these calculations (as show on summary page) over successive downsampling of the data.
By reviewing the curve of the down sampled metrics we can assess whether we are approaching saturation for either of these metrics.
]
]
Often empty droplets can still make their way past Cell Ranger, or we might see an issue in the Cell Ranger web report itself. If this happens we might want to do custom filtering outside of Cell Ranger.
One cause of these issues could be ambient RNAs. Ambient RNAs from lysed cells make empty droplets seem like they contain a cell or contaminate droplets and.
Lets look at the PBMC 10k data from 10X Genomics as an example to deal with these issues. This time we will use the raw matrix. This way we don’t use the filtering that Cell Ranger does.
CellBender is a command line toolkit developed to eliminate technical artifacts from scRNA-Seq data such as empty droplets and ambient RNAs link.
In the current version, it contains remove-background module, which is applied to: * Detect empty droplets and ambient RNAs from raw count matrix in CellRanger output. * Remove any droplets considered to be empty, correct the interference of ambient RNAs.
We take the raw matrices from Cell Ranger and run CellBender. The cuda argument will speed up the function, but rquires your run it on a GPU.
input_h5=the_raw_matrix_in_h5_format_from_cellranger #essential
output_h5=assign_the_h5_file_path_for_the_cellbender_corrected_matrix # essential
fpr=threshold_of_FALSE_POSITIVE_RATE # default 0.01
epochs=number_of_epochs_to_train # default 150
num_train=number_of_times_to_attempt_to_train_the_model # default 1. would speed up while setting greater
cellbender remove-background --input $input_h5 --output $output_h5 --fpr $fpr --epochs $epochs --num-training-tries $num_train --cudaThe h5 file produced by CellBender is currently incompatible with Seurat. We can use the following function to convert the h5 file to a Seurat object.
CellBender has a guide how to convert the h5 file to a Seurat object here, using PyTables. This is another command line tool.
ptrepack --complevel 5 celbender_filtered.h5:/matrix celbender_filtered_forseurat.h5:/matrixWe have some processed results here from CellBender. We can compare this to our filtered matrix from Cell Ranger.
library(Seurat)
cbFilt_mtx <- Read10X_h5("data/cbFilt_PBMCv3_20230324_filtered.h5")We can also need to read back in our filtered PBMC data from Cell Ranger.
download.file("https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz",
"10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
untar("10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
filt_mtx <- Seurat::Read10X("filtered_feature_bc_matrix/")dim(cbFilt_mtx)## [1] 36601 12244
dim(filt_mtx)## [1] 36601 11485
We are going to try out a few tools and approaches. We want to wrap some of our analysis steps into a function to simplify rerunning things.
Normalization: * Log normalization with scale factor = 10,000 * Find Variable features with vst, select top 2000 variable features
data_proc <- function(seu) {
seu <- NormalizeData(seu, normalization.method = "LogNormalize", scale.factor = 10000)
seu <- FindVariableFeatures(seu, select.method = "vst", nfeatures = 2000)
return(seu)
}Make clusters: * Scale data with ScaleData() * Principle Component Analysis by using RunPCA() with npcs=30 PCs * Make non-linear dimensional reduction in UMAP by using RunUMAP() with dims=1:10 * Estimate Neighbors by using FindNeighbors() with dims=1:10 * Identify clusters with FindClusters() by using resolution=0.5
quick_clust <- function(seu) {
set.seed(42)
seu <- ScaleData(seu, verbose = FALSE)
seu <- RunPCA(seu, npcs = 30, verbose = FALSE)
seu <- RunUMAP(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
seu <- FindNeighbors(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
seu <- FindClusters(seu, resolution = 0.5, verbose = FALSE)
return(seu)
}message("processing matrix from CellRanger")
seu <- CreateSeuratObject(filt_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_filt <- seu
message("processing matrix from CellBender")
seu <- CreateSeuratObject(cbFilt_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_cbFilt <- seuDimPlot(seu_filt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) + NoLegend()DimPlot(seu_cbFilt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
NoLegend()mark_gene <- c("CD3E", "CCR7")
VlnPlot(seu_filt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)mark_gene <- c("CD3E", "CCR7")
VlnPlot(seu_cbFilt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)rm(cbFilt_mtx, filt_mtx, seu_filt, seu_cbFilt)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 14768688 788.8 22661588 1210.3 22661588 1210.3
## Vcells 139168895 1061.8 627910701 4790.6 695007868 5302.5
We need to load in the dataset, but it is important not to set any filters yet. This means we will be using the raw matrix.
library(DropletUtils)
download.file("https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz",
"10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz")
untar("10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz")
raw_mtx <- Seurat::Read10X("raw_feature_bc_matrix")
dim(raw_mtx)## [1] 36601 2099284
Calculate the counts for each cell barcode and their rank.
bcrank <- barcodeRanks(raw_mtx)
head(bcrank, 2)## DataFrame with 2 rows and 3 columns
## rank total fitted
## <numeric> <integer> <numeric>
## AAACCCAAGAAACCAT-1 1752547 0 NA
## AAACCCAAGAAACCCA-1 914872 1 NA
To draw a custom knee plot we need to remove duplicated ranks and grab the knee and inflection points.
uniq <- !duplicated(bcrank$rank)
bcrank <- bcrank[uniq, ]
knee <- metadata(bcrank)$knee
message(paste0("knee point: ", knee))
inflection <- metadata(bcrank)$inflection
message(paste0("inflection point: ", inflection))We can now draw our knee plot using ggplot().
ggplot(as.data.frame(bcrank), aes(x = rank, y = total)) + geom_point() + geom_hline(yintercept = knee,
linetype = "dashed", color = "blue") + geom_hline(yintercept = inflection, linetype = "dashed",
color = "green") + scale_x_continuous(trans = "log10") + scale_y_continuous(trans = "log10") +
labs(x = "cell barcode ranked by counts", y = "UMI counts of each cell barcode") +
theme_classic()We can now draw our knee plot using ggplot(). The knee is labelled in blue. The inflection is labelled in green.
We can detect empty droplets with DropletUtils::emptyDrops(). - The p-value for each cell is performed by Monte Carlo simulation basing on the deviation of a given cell to the ambient RNA pool.
e.out <- emptyDrops(raw_mtx)
e.out <- e.out[order(e.out$FDR), ]
head(e.out)## DataFrame with 6 rows and 5 columns
## Total LogProb PValue Limited FDR
## <integer> <numeric> <numeric> <logical> <numeric>
## AAACCCAAGGTAGTCG-1 6088 -7854.52 9.999e-05 TRUE 0
## AAACCCACAATCCAGT-1 14628 -17829.20 9.999e-05 TRUE 0
## AAACCCACACCGTCTT-1 7642 -9887.60 9.999e-05 TRUE 0
## AAACCCACATCTCATT-1 9834 -13742.83 9.999e-05 TRUE 0
## AAACCCACATTGACTG-1 7409 -10416.66 9.999e-05 TRUE 0
## AAACCCAGTCACTGAT-1 5834 -8026.81 9.999e-05 TRUE 0
The distribution of UMI counts per cell is often a bimodal distribution. The high-UMI group are likely droplets with a cell. In contrast, the droplets with low UMI are more likely empty droplets containing ambient RNAs.
ggplot(as.data.frame(e.out), aes(x = Total)) + geom_histogram() + geom_vline(xintercept = knee,
color = "red", linetype = "dashed") + labs(x = "UMI counts per cell", y = "Frequency") +
scale_y_continuous(trans = "log10") + scale_x_continuous(trans = "log10") + theme_classic()The distribution of UMI counts per cell is often a bimodal distribution. The high-UMI group are likely droplets with a cell. In contrast, the droplets with low UMI are more likely empty droplets containing ambient RNAs.
In the DropletUtils manual, the FDR threshold is set to 0.001. In most cases. We can filter based on this cutoff.
table(e.out$FDR < 0.001)##
## FALSE TRUE
## 605 11580
cell_sel <- rownames(e.out)[e.out$FDR < 0.001]
filt_mtx <- raw_mtx[, colnames(raw_mtx) %in% cell_sel]The cell-free RNA molecules randomly spread in the solution. It can be enclosed into droplets during the steps of library preparation. So, it would be falsely increase the UMI counts of genes.
Ambient RNAs present two signature
According to these two properties, we could estimate ambient RNA contamination rates and correct it. Here we will demonstrate how to correct ambient RNA contamination with DropletUtils and SoupX, respectively.
We can see here an example of very high ambient RNA. This marker gene is supposed to be very specific. The UMAP also does not look great.
The same dataset has very high MT contamination. This is a good indicator of ambient RNA contamination.
Lets start by estimating our ambient RNA contamination. This is from DropletUtils.
amb <- metadata(e.out)$ambient[, 1]
head(amb)## AL627309.1 AL627309.3 AL627309.5 AL627309.4 AL669831.2 LINC01409
## 8.369426e-07 9.859354e-08 5.635719e-06 9.859354e-08 9.859354e-08 6.511074e-06
We can read back in our filtered PBMC data.
download.file("https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz",
"10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
untar("10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
filt_mtx <- Seurat::Read10X("filtered_feature_bc_matrix/")Then filter out empty droplets.
filt_mtx_drop <- filt_mtx[rownames(filt_mtx) %in% names(amb), ]
seu <- CreateSeuratObject(filt_mtx_drop)seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
sce <- as.SingleCellExperiment(seu, assay = "RNA")plotUMAP(sce, colour_by = "seurat_clusters")Lets put our uncorrected, filtered dataset in a list.
seu_list <- list()
seu_list[["woCorr"]] <- seuWe can use the to removeAmbience() function to remove ambient RNAs.
out <- removeAmbience(counts(sce), ambient = amb, groups = sce$seurat_clusters)
rownames(out) <- rownames(sce)
colnames(out) <- colnames(sce)Now that we have run a correction, we want to reprocess and cluster our data.
seu_list[["withCorr"]] <- CreateSeuratObject(out)
seu_list[["withCorr"]] <- data_proc(seu_list[["withCorr"]])
seu_list[["withCorr"]] <- ScaleData(seu_list[["withCorr"]])
seu_list[["withCorr"]] <- quick_clust(seu_list[["withCorr"]])Lets compare to our marker genes.
mark_gene <- c("CCR7", "CD8A", "MS4A1", "CD14", "FCGR3A", "FCER1A", "GNLY", "NKG7",
"PPBP")
mark_gene## [1] "CCR7" "CD8A" "MS4A1" "CD14" "FCGR3A" "FCER1A" "GNLY" "NKG7"
## [9] "PPBP"
First lets look at the UMAP without ambient RNA correction.
DimPlot(seu_list$woCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
NoLegend()Lets now compare to UMAP wit ambient RNA correction.
DimPlot(seu_list$withCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
NoLegend()Lets now look at a violin plot in the uncorrected data.
VlnPlot(seu_list$woCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)And then again in the corrected.
VlnPlot(seu_list$withCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)SoupX is an alternative tool for droplet processing, developed for ambient RNA detection and correction. link. It includes the following steps. - Estimate ambient RNA expression in the empty droplets. - Estimate contamination rate by using the ambient RNA expression levels across clusters. - Correct ambient RNA contamination.
We need our filtered and unfiltered matrix. Plus our processed filtered Seurat object.
clust <- setNames(seu$seurat_clusters, Cells(seu))
seu_filt <- seuImport our matrices into the SoupX channel object.
library(SoupX)
soupOBJ <- SoupChannel(raw_mtx, filt_mtx)
soupOBJ <- setClusters(soupOBJ, clust)The autoEstCont() can be used to detect contamination. We can then export the correct counts with adjustCount().
soupOBJ <- autoEstCont(soupOBJ)
autoCor_mtx <- adjustCounts(soupOBJ)We will now put our corrected matrix into a Seurat object.
seu <- CreateSeuratObject(autoCor_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_autoCorr <- seuWe can estimate contamination using known marker genes.
mark_list <- list(`CD4 T-cell` = c("IL7R", "CCR7", "S100A4"), `CD8 T-cell` = c("CD8A"),
`B-Cell` = c("MS4A1"), Monocyte = c("CD14", "FCGR3A"), DC = c("FCER1A"), NK = c("NKG7",
"GNLY"), Platelet = c("PPBP"))
use_toEst <- estimateNonExpressingCells(soupOBJ, nonExpressedGeneList = mark_list)
soupOBJ <- calculateContaminationFraction(soupOBJ, mark_list, useToEst = use_toEst)
rho <- unique(soupOBJ$metaData$rho)
rho## [1] 0.0193281
soupOBJ <- setContaminationFraction(soupOBJ, rho, forceAccept = TRUE)
estCor_mtx <- adjustCounts(soupOBJ)seu <- CreateSeuratObject(estCor_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_estCorr <- seumark_gene <- c("CCR7", "CD8A", "MS4A1", "CD14", "FCGR3A", "FCER1A", "GNLY", "NKG7",
"PPBP")
mark_gene## [1] "CCR7" "CD8A" "MS4A1" "CD14" "FCGR3A" "FCER1A" "GNLY" "NKG7"
## [9] "PPBP"
DimPlot(seu_filt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) + NoLegend()VlnPlot(seu_filt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)DimPlot(seu_autoCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
NoLegend()VlnPlot(seu_autoCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)DimPlot(seu_estCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
NoLegend()VlnPlot(seu_estCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)In our simplified workflow we used a log normalization approach. Recently an alternative approach has been gaining popularity: SCTransform. Lets compare this back to our log normalization.
seu_obj <- SCTransform(seu_obj, variable.features.n = 3000)
seu_obj## An object of class Seurat
## 56951 features across 11485 samples within 2 assays
## Active assay: SCT (20350 features, 3000 variable features)
## 3 layers present: counts, data, scale.data
## 1 other assay present: RNA
The normalized data is stored in assay SCT. We can see this is now the active assay. The log results are stored in the RNA assay. We can control which is the active assay with the DefaultAssay() function.
DefaultAssay(seu_obj) <- "RNA"
DefaultAssay(seu_obj) <- "SCT"## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 14776480 789.2 22661588 1210.3 22661588 1210.3
## Vcells 351423300 2681.2 578799537 4415.9 895229794 6830.1
We can extract out the results of our normalized and scaled data with the GetAssayData() function.
log_mat <- GetAssayData(seu_obj, assay = "RNA", slot = "data")
log_mat <- as.matrix(log_mat)
log_avgExp <- rowMeans(log_mat)
sct_mat <- GetAssayData(seu_obj, assay = "SCT", slot = "data")
sct_mat <- as.matrix(sct_mat)
sct_avgExp <- rowMeans(sct_mat)matched_names <- intersect(names(log_avgExp), names(sct_avgExp))
log_avgExp <- log_avgExp[matched_names]We can test the concordance with Spearman’s correlation.
dat <- data.frame(logNorm = log_avgExp, SCT = sct_avgExp)
cor_val <- cor.test(log_avgExp, sct_avgExp, method = "spearman")
ggplot(dat, aes(x = logNorm, y = SCT)) + geom_point() + geom_smooth() + labs(x = "Log_Normalization",
y = "SCTransform", subtitle = paste0("rho=", round(cor_val$estimate, 3), "; p-value=",
cor_val$p.value[1])) + theme_classic()We can see there is very high similarity between the results.
We will discuss a number of approaches and methods we may need to use to merge datasets.
We will be using the IFNB-Stimulated and Control PBMCs from Seurat Data. The easiest way to get this data is from their custom package SeuratData which is hosted on GitHub.
We will quickly show you how to get this data, but it isn’t necessary to run these steps.
remotes::install_github("satijalab/seurat-data")library(Seurat)
library(SeuratData)
InstallData("ifnb")
LoadData("ifnb")
head(ifnb, 2)## orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
## AAACATACATTTCC.1 IMMUNE_CTRL 3017 877 CTRL CD14 Mono
## AAACATACCAGAAA.1 IMMUNE_CTRL 2481 713 CTRL CD14 Mono
This dataset is already loaded in as a Seurat object, and it is already merged. So we need to split it, so we can merge it ourselves! The groups are in the “stim” column of the metadata.
table(ifnb$stim)
ifnb_list <- Seurat::SplitObject(ifnb, split.by = "stim")
ifnb_listAt this point you could save your data object for later using the save() or saveRDS() function.
save("ifnb_list", file = "data/seuOBJ_IFNB_splitByStim.RData")We have the ifnb_list saved in an RData object, which you can load in.
load("data/seuOBJ_IFNB_splitByStim.RData")We can use merge() function to integrate our two data sets, as they are. We need to provide Group information and a name for the project as arguments.
ifnb_merge <- merge(ifnb_list$CTRL, ifnb_list$STIM, add.cell.ids = c("CTRL", "STIM"),
project = "ifnb_seuMerge")
head(ifnb_merge, 2)## orig.ident nCount_RNA nFeature_RNA stim
## CTRL_AAACATACATTTCC.1 IMMUNE_CTRL 3017 877 CTRL
## CTRL_AAACATACCAGAAA.1 IMMUNE_CTRL 2481 713 CTRL
## seurat_annotations
## CTRL_AAACATACATTTCC.1 CD14 Mono
## CTRL_AAACATACCAGAAA.1 CD14 Mono
We can use our processing and clustering functions to analyse our merged dataset.
ifnb_merge <- data_proc(ifnb_merge)
ifnb_merge <- quick_clust(ifnb_merge)Our UMAP shows our cells are distinct, depending on the condition.
DimPlot(ifnb_merge, group.by = "stim", pt.size = 0.2)A given cell type should often be clustered together. This pattern indicates the opposite. Different cell types are split into distinct groups depending on the sample.
DimPlot(ifnb_merge, group.by = "seurat_annotations", pt.size = 0.2, split.by = "stim")Reciprocal PCA minimizes the batch effects while merging different data sets.
This workflow is modified from canonical correlation analysis (CCA), which is widely used to merge batches.
First, we prepare the data for integration. We will normalize the data sets separately. Than, we need to identify features for integration. This is similar to the VariableFeatures function we ran on a single dataset. Lastly we run scaling and PCA, using these features.
ifnb_list_rpca <- lapply(ifnb_list, data_proc)
feats <- SelectIntegrationFeatures(ifnb_list_rpca)
ifnb_list <- lapply(ifnb_list, function(seu, feats) {
seu <- ScaleData(seu, features = feats, verbose = FALSE)
seu <- RunPCA(seu, features = feats, verbose = FALSE)
return(seu)
}, feats)We can then identify anchors. These are the features through which we will integrate our data. Once we have these features, we can then integrate our data sets together.
anchors <- FindIntegrationAnchors(ifnb_list, anchor.features = feats, reduction = "rpca")
ifnb_merge <- IntegrateData(anchorset = anchors)
ifnb_merge## An object of class Seurat
## 16053 features across 13999 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 layer present: data
## 1 other assay present: RNA
To evaluate how well the merge has worked we must check the clustering. Again we must scale, and then use our quick_clust function.
We can now see that our two data sets overlay with each other.
ifnb_merge <- ScaleData(ifnb_merge)
ifnb_merge <- quick_clust(ifnb_merge)
DimPlot(ifnb_merge, group.by = "stim", pt.size = 0.2)We can now see that our two data sets overlay with each other.
We can check the numbers in each cluster. Broadly, there are similar numbers per cluster now.
tbl <- table(ifnb_merge$stim, ifnb_merge$seurat_clusters)
barplot(tbl, beside = T, main = "Cell numbers in each cluster of each group")We can also check the cell types. Using UMAPs we can split and compare across our conditions and cell types.
DimPlot(ifnb_merge, group.by = "seurat_annotations", split.by = "stim", pt.size = 0.2)Using heatmaps we can also check how specific each cluster is to each cell type.
library(pheatmap)
tbl <- table(ifnb_merge$seurat_clusters, ifnb_merge$seurat_annotations)
pheatmap(tbl, scale = "column")Mutual Nearest Neighbors (MNN) approach uses paired cells instead of anchor genes to find the difference between batches.
The MNN correction was published by Marioni et al, Nature (2018). link
First we have to convert Seurat object to SingleCellExperiment object.
sce_list <- lapply(ifnb_list, function(seu) {
sce <- as.SingleCellExperiment(seu, assay = "RNA")
rowData(sce)$SYMBOL <- rownames(sce)
return(sce)
})
sce_list## $CTRL
## class: SingleCellExperiment
## dim: 14053 6548
## metadata(0):
## assays(2): counts logcounts
## rownames(14053): AL627309.1 RP11-206L10.2 ... AP001062.7 LRRC3DN
## rowData names(1): SYMBOL
## colnames(6548): AAACATACATTTCC.1 AAACATACCAGAAA.1 ... TTTGCATGCTTCGC.1
## TTTGCATGGTCCTC.1
## colData names(6): orig.ident nCount_RNA ... seurat_annotations ident
## reducedDimNames(1): PCA
## mainExpName: RNA
## altExpNames(0):
##
## $STIM
## class: SingleCellExperiment
## dim: 14053 7451
## metadata(0):
## assays(2): counts logcounts
## rownames(14053): AL627309.1 RP11-206L10.2 ... AP001062.7 LRRC3DN
## rowData names(1): SYMBOL
## colnames(7451): AAACATACCAAGCT.1 AAACATACCCCTAC.1 ... TTTGCATGCTAAGC.1
## TTTGCATGGGACGA.1
## colData names(6): orig.ident nCount_RNA ... seurat_annotations ident
## reducedDimNames(1): PCA
## mainExpName: RNA
## altExpNames(0):
As with RPCA we need to model and identify highly variable genes. For this we will use the scran functions modelGeneVar() and getTopHVGs(). We simply provide our SingleCellExperiment object.
library(scran)
dec_list <- lapply(sce_list, modelGeneVar)
hvgc_list <- lapply(sce_list, getTopHVGs, prop = 0.1)Next we will find the features that are shared between samples. We must first define the shared “universe” of genes between our samples, and subset our variable genes to these. We can then combine the the variance to find variable features in both data sets.
universe <- intersect(rownames(sce_list$CTRL), rownames(sce_list$STIM))
sce_list <- lapply(sce_list, function(sce, universe) {
sce <- sce[universe, ]
return(sce)
}, universe)
dec_list <- lapply(dec_list, function(dec, universe) {
dec <- dec[universe, ]
return(dec)
}, universe)
combined_dec <- combineVar(dec_list$CTRL, dec_list$STIM)
chosen_hvgs <- combined_dec$bio > 0We will use the batchelor package to run MNN with the fastMNN() function. * d: number of principles evaluated * k: number of nearest neighbors to consider * subset.row: subset genes. Here, we are using the top variable features
library(batchelor)
mnn_res <- fastMNN(CTRL = sce_list$CTRL, STIM = sce_list$STIM, d = 50, k = 20, subset.row = chosen_hvgs)
mnn_res## class: SingleCellExperiment
## dim: 1958 13999
## metadata(2): merge.info pca.info
## assays(1): reconstructed
## rownames(1958): HES4 ISG15 ... CTD-2521M24.5 AJ006998.2
## rowData names(1): rotation
## colnames(13999): AAACATACATTTCC.1 AAACATACCAGAAA.1 ... TTTGCATGCTAAGC.1
## TTTGCATGGGACGA.1
## colData names(1): batch
## reducedDimNames(1): corrected
## mainExpName: NULL
## altExpNames(0):
The resulting SingleCellExperiment object contains the batch information. We can also retrieve a matrix of our dimension reduction results and corrected log counts with the reducedDim() and assay() functions.
table(mnn_res$batch)##
## CTRL STIM
## 6548 7451
reducedDim(mnn_res, "corrected")[1:2, ]## [,1] [,2] [,3] [,4] [,5]
## AAACATACATTTCC.1 0.4692084 0.002558369 0.24112460 -0.01932947 -0.02919753
## AAACATACCAGAAA.1 0.4849751 -0.221433871 -0.03194567 -0.04457301 0.02085704
## [,6] [,7] [,8] [,9] [,10]
## AAACATACATTTCC.1 -0.0001116517 -0.05219879 0.01560541 0.018456590 -0.02042355
## AAACATACCAGAAA.1 0.0208653688 0.02518230 0.05640069 -0.001590786 0.02042924
## [,11] [,12] [,13] [,14] [,15]
## AAACATACATTTCC.1 0.008021788 0.04396844 0.005131837 0.01141104 -0.0002610285
## AAACATACCAGAAA.1 0.006484710 -0.02690261 -0.014495237 -0.01603656 0.0417677341
## [,16] [,17] [,18] [,19] [,20]
## AAACATACATTTCC.1 0.001044311 0.02583133 -0.004660621 0.003453847 0.007833429
## AAACATACCAGAAA.1 -0.002738321 -0.01539083 0.028006124 -0.010241152 0.016709705
## [,21] [,22] [,23] [,24] [,25]
## AAACATACATTTCC.1 0.004925765 0.01688819 0.01693241 0.003615093 0.007882780
## AAACATACCAGAAA.1 -0.019963264 -0.02561860 -0.01056931 -0.010869554 0.005000097
## [,26] [,27] [,28] [,29]
## AAACATACATTTCC.1 -0.01073247 0.0075159768 0.006422343 -0.004441872
## AAACATACCAGAAA.1 0.02215599 0.0008773935 -0.003697461 -0.014406621
## [,30] [,31] [,32] [,33]
## AAACATACATTTCC.1 0.002957031 0.01006521 -0.002491118 -0.001875366
## AAACATACCAGAAA.1 -0.003685269 -0.01067559 -0.021875400 0.011806193
## [,34] [,35] [,36] [,37]
## AAACATACATTTCC.1 -0.008862147 0.002032289 -0.008468607 -0.007954116
## AAACATACCAGAAA.1 0.003560579 -0.002324063 0.002952769 -0.006034000
## [,38] [,39] [,40] [,41]
## AAACATACATTTCC.1 -0.0009951656 0.002792573 0.002435519 -0.015530207
## AAACATACCAGAAA.1 -0.0009532853 -0.002280508 0.010147310 0.003641573
## [,42] [,43] [,44] [,45]
## AAACATACATTTCC.1 0.0009633955 0.0026935385 0.0002299275 0.003499859
## AAACATACCAGAAA.1 -0.0034695643 -0.0008694543 0.0037271426 0.002181876
## [,46] [,47] [,48] [,49]
## AAACATACATTTCC.1 -0.001978661 -0.009881723 -0.004609779 0.001163945
## AAACATACCAGAAA.1 -0.001995334 -0.005362670 0.001793127 -0.005419268
## [,50]
## AAACATACATTTCC.1 -0.0095770715
## AAACATACCAGAAA.1 0.0003581869
assay(mnn_res, "reconstructed")[1:2, 1:5]## <2 x 5> LowRankMatrix object of type "double":
## AAACATACATTTCC.1 AAACATACCAGAAA.1 AAACATACCTCGCT.1 AAACATACCTGGTA.1
## HES4 -0.001084748 -0.001208982 -0.001217450 0.001368426
## ISG15 -0.112621047 -0.128283733 -0.111386938 -0.138798378
## AAACATACGATGAA.1
## HES4 -0.001118786
## ISG15 -0.129904496
Make UMAP using the scater package.
library(scater)
set.seed(1001)
mnn_res <- runUMAP(mnn_res, dimred = "corrected")
mnn_res$batch <- factor(mnn_res$batch)
plotUMAP(mnn_res, colour_by = "batch")We will cluster using SNN graph approach from scran as this is comptaible with our SingleCellExperiment object.
snn.gr <- buildSNNGraph(mnn_res, use.dimred = "corrected")
cluster_mnn <- igraph::cluster_louvain(snn.gr)$membership
table(cluster_mnn)## cluster_mnn
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 806 839 1534 418 1072 835 933 848 924 1931 1350 2453 56
Lets check the UMAP for our clustered results.
mnn_res$cluster <- factor(cluster_mnn)
plotUMAP(mnn_res, colour_by = "cluster")We can also check how many cells from each cluster belong to each group. With this approach you can see that there is a lot more group-specific clusters.
tbl <- table(mnn_res$batch, mnn_res$cluster)
barplot(tbl, beside = T, main = "Cell numbers in each cluster of each group")We must annotate our SingleCellExperiment object with different cell type information. We can then visualize the cell types on our UMAP.
cellType <- lapply(sce_list, function(x) {
res <- setNames(as.character(colData(x)$seurat_annotations), colnames(x))
return(res)
})
cell_type <- c(cellType$CTRL, cellType$STIM)
mnn_res$cell_type <- cell_type[match(rownames(colData(mnn_res)), names(cell_type))]
mnn_res$cell_type <- factor(mnn_res$cell_type)plotUMAP(mnn_res, colour_by = "cell_type")We can also use a heatmap to look at the specificity of each cell type to each cluster. Most are cluster-specific.
tbl <- table(mnn_res$cluster, mnn_res$cell_type)
pheatmap(tbl, scale = "column")Performance is dependent on experimental context.
Generally: * RPCA - More homogeneous
* MNN - More heterogeneous
We can prepare for Harmony in much the same way as we prepare for the simple Seurat merge: merge, normalize, scale, PCA and UMAP.
seu_obj <- merge(ifnb_list$CTRL, ifnb_list$STIM)
seu_obj <- data_proc(seu_obj)
seu_obj <- ScaleData(seu_obj)
seu_obj <- RunPCA(seu_obj)
seu_obj <- RunUMAP(seu_obj, reduction = "pca", dims = 1:10, reduction.name = "umap")As you can see we are back with our completely seperate groups.
DimPlot(seu_obj)We can use the RunHarmony() function to implement the Harmony correction.
library(harmony)
seu_obj <- RunHarmony(seu_obj, group.by.vars = "stim", assay.use = "RNA")
seu_obj <- RunUMAP(seu_obj, reduction = "harmony", dims = 1:10, reduction.name = "umap_harmony")
DimPlot(seu_obj, reduction = "umap_harmony")To annotate the Single-cell data sets, we can evaluate the gene expression pattern of well known cell-type specific marker genes and make a manual annotation, like we did in section II. Here, we will introduce two more strategies to make cell type annotations automatically: 1. Mapping and annotating query datasets with Seurat using a reference data set. link 2. Make annotation with SingleR link
The demo data, panc8, was fetched by using SeuratData(). It contains single-cell RNA-Seq of human pancreatic islet sequenced with different techniques.
The demo data contains * CELSeq dataset1 (GSE81076), as reference * CELSeq dataset2 (GSE85241), as reference * SMART-Seq2 (E-MTAB-5061), as reference * Fluidigm C1 (GSE86469), as query
As we did before we will split this dataset. We have also prepped this dataset for you, if you want to load the data object in directly.
library(Seurat)
library(SeuratData)
InstallData("panc8")
data("panc8")Split up the datasets by techniques using SplitObject().
panc8 <- UpdateSeuratObject(panc8)
seu_list <- SplitObject(panc8, split.by = "tech")
names(seu_list) <- c("celseq1", "celseq2", "smartseq", "fluidigmc1", "indrop")load("data/panc8.RData")For this we need a reference dataset and a query dataset. Merge celseq1 and celseq2 data with RPCA to make the reference.
seu_list <- lapply(seu_list, data_proc)
ref_list <- seu_list[c("celseq1", "celseq2")]
feats <- SelectIntegrationFeatures(ref_list)
ref_list <- lapply(ref_list, function(seu, feats) {
seu <- ScaleData(seu, features = feats, verbose = FALSE)
seu <- RunPCA(seu, features = feats, verbose = FALSE)
return(seu)
}, feats)Merge celseq1 and celseq2 data with RPCA
anchors <- FindIntegrationAnchors(ref_list, anchor.features = feats, reduction = "rpca")
ref_panc <- IntegrateData(anchorset = anchors)
ref_panc <- ScaleData(ref_panc)
ref_panc <- quick_clust(ref_panc)How does the data look?
DimPlot(ref_panc, group.by = "seurat_clusters", split.by = "tech", pt.size = 0.2,
label = TRUE)panc_list <- list(ref = ref_panc, query = seu_list$smartseq)
feats <- SelectIntegrationFeatures(panc_list)
panc_list <- lapply(panc_list, function(seu, feats) {
seu <- ScaleData(seu, features = feats, verbose = FALSE)
seu <- RunPCA(seu, features = feats, verbose = FALSE)
return(seu)
}, feats)Identify the anchors between reference and query data sets, using FindTransferAnchors(). These are essential to transfer information from our reference to our query. We can then transfer the cell type information. For each cell in query dataset, the score for each given cell type was estimated by the gene expression pattern of anchor genes using the TransferData() function.
anchors <- FindTransferAnchors(reference = panc_list$ref, query = panc_list$query,
dims = 1:30, reference.reduction = "pca")
pred_res <- TransferData(anchorset = anchors, refdata = panc_list$ref$celltype)
head(pred_res, 2)## predicted.id prediction.score.gamma prediction.score.acinar
## AZ_A2 gamma 0.9913817 0
## AZ_B9 alpha 0.0000000 0
## prediction.score.alpha prediction.score.delta prediction.score.beta
## AZ_A2 0.008618342 0 0
## AZ_B9 1.000000000 0 0
## prediction.score.ductal prediction.score.endothelial
## AZ_A2 0 0
## AZ_B9 0 0
## prediction.score.activated_stellate prediction.score.schwann
## AZ_A2 0 0
## AZ_B9 0 0
## prediction.score.mast prediction.score.macrophage
## AZ_A2 0 0
## AZ_B9 0 0
## prediction.score.epsilon prediction.score.quiescent_stellate
## AZ_A2 0 0
## AZ_B9 0 0
## prediction.score.max
## AZ_A2 0.9913817
## AZ_B9 1.0000000
The cell type with highest score was assigned to the given cell. We can visualize this score with a heatmap.
mat <- as.matrix(pred_res[, -c(1, 15)])
colnames(mat) <- gsub("prediction.score.", "", colnames(mat))
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE)pred_cellType <- setNames(pred_res$predicted.id, rownames(pred_res))
panc_list$query[["cellType_predBySeurat"]] <- pred_cellType[match(Cells(panc_list$query),
names(pred_cellType))]
head(panc_list$query, 2)## orig.ident nCount_RNA nFeature_RNA tech replicate assigned_cluster
## AZ_A2 AZ 384801 3611 smartseq2 smartseq2 <NA>
## AZ_B9 AZ 654549 4433 smartseq2 smartseq2 <NA>
## celltype dataset cellType_predBySeurat
## AZ_A2 gamma smartseq2 gamma
## AZ_B9 alpha smartseq2 alpha
table(panc_list$query$cellType_predBySeurat)##
## acinar activated_stellate alpha beta
## 192 55 1028 316
## delta ductal endothelial gamma
## 122 439 21 200
## macrophage mast quiescent_stellate schwann
## 7 7 5 2
SingleR is bioconductor package which can be used to predict annotations of a single-cell dataset.
This package uses SingleCellExperiment objects, so we need to convert our Seurat object.
sce_list <- lapply(panc_list, function(seu) {
sce <- as.SingleCellExperiment(seu, assay = "RNA")
return(sce)
})The score is generated comparing the expression levels of a cell in query dataset and the expression pattern of certain group (eg. cell types) in reference dataset. A cell would be assigned as the cell type which has highest score.
library(SingleR)
pred_res <- SingleR(ref = sce_list$ref, test = sce_list$query, labels = sce_list$ref$celltype)
head(pred_res, 2)## DataFrame with 2 rows and 4 columns
## scores labels delta.next pruned.labels
## <matrix> <character> <numeric> <character>
## AZ_A2 0.300904:0.178568:0.436055:... gamma 0.1394303 NA
## AZ_B9 0.318227:0.227997:0.529730:... alpha 0.0861217 alpha
By converting to a matrix, we can check the cell type scoring using a heatmap.
mat <- as.matrix(pred_res$scores)
rownames(mat) <- rownames(pred_res)
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE)Lastly we want to add our annotation back to our query dataset.
cell_type <- setNames(pred_res$pruned.labels, rownames(pred_res))
panc_list$query$cellType_predBySingleR <- cell_type[match(Cells(panc_list$query),
names(cell_type))]
head(panc_list$query, 2)## orig.ident nCount_RNA nFeature_RNA tech replicate assigned_cluster
## AZ_A2 AZ 384801 3611 smartseq2 smartseq2 <NA>
## AZ_B9 AZ 654549 4433 smartseq2 smartseq2 <NA>
## celltype dataset cellType_predBySeurat cellType_predBySingleR
## AZ_A2 gamma smartseq2 gamma <NA>
## AZ_B9 alpha smartseq2 alpha alpha
First we can compare this back to the original annotation. We will look for how many overlap.
Seurat annotation:
table(panc_list$query$cellType_predBySeurat == panc_list$query$celltype)##
## FALSE TRUE
## 41 2353
SingleR annotation:
table(panc_list$query$cellType_predBySingleR == panc_list$query$celltype)##
## FALSE TRUE
## 35 2299
Next we can compare our two annotations:
table(panc_list$query$cellType_predBySeurat == panc_list$query$cellType_predBySingleR)##
## FALSE TRUE
## 37 2297
tbl <- table(panc_list$query$cellType_predBySeurat, panc_list$query$cellType_predBySingleR)
pheatmap::pheatmap(tbl, scale = "row")This isn’t always the case. This is good demonstration data.
sce <- readRDS("data/sceOBJ_Nestorowa_HSC.rds")
sce## class: SingleCellExperiment
## dim: 46078 1656
## metadata(0):
## assays(2): counts logcounts
## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ...
## ENSMUSG00000107391 ENSMUSG00000107392
## rowData names(3): GENEID SYMBOL SEQNAME
## colnames(1656): HSPC_025 HSPC_031 ... Prog_852 Prog_810
## colData names(5): cell.type FACS sizeFactor label cellType
## reducedDimNames(3): diffusion PCA UMAP
## mainExpName: endogenous
## altExpNames(1): ERCC
To find the trajectories we need to fit our high dimensional data onto a curve. This step is processed with the slingshot link package.
To run slingshot we need: * Our sce object: sce * Clusters: sce$label * Our choice of dimension reduction approach: PCA * To reduce computational load the adjacent 100 cells are aggregated into single point (approx_points=100) * To avoid connecting unrelated trajectories, OMEGA clustering is introduced (omega=TRUE).
library(slingshot)
sce.sling <- slingshot(sce, cluster = sce$label, reducedDim = "PCA", approx_points = 100,
omega = TRUE)colData(sce.sling)[1, 1:5]## DataFrame with 1 row and 5 columns
## cell.type FACS sizeFactor label
## <lgCMatrix> <matrix> <numeric> <factor>
## HSPC_025 FALSE:FALSE:TRUE:... 27675.2:1.17991:1.12999:... 0.499986 3
## cellType
## <character>
## HSPC_025 Erythrocytes
Slingshot has fitted principle curves. We now will embed these cruves in UMAP space embedCurves(). We can see the number o lineages detected at this point.
embedded <- embedCurves(sce.sling, "UMAP")
embedded@metadata$lineages## $Lineage1
## [1] "8" "5" "9" "1" "2" "3" "4"
##
## $Lineage2
## [1] "8" "5" "9" "1" "2" "6"
##
## $Lineage3
## [1] "7"
Once we have lineages, we can then find the position of each cell along these trajectories. This is estimating the pseudotime.
pseudo.paths <- slingPseudotime(sce.sling)
head(pseudo.paths, 2)## Lineage1 Lineage2 Lineage3
## HSPC_025 111.82993 NA NA
## HSPC_031 96.15939 99.77961 NA
It will be useful to combine these times, to find the shared time.
avg_pseudoTime <- rowMeans(pseudo.paths, na.rm = TRUE)
colData(sce.sling)$avg_pseudoTime <- avg_pseudoTimeOnce we have the average pseudotime we can always project these values on to our UMAP.
plotUMAP(sce.sling, colour_by = "avg_pseudoTime")The slingCurves() function can fetch essential information of each principle curve. The results are inside a list, with each principle curve in an element. For each principle curve, we can extract the UMAP data and the order of cells for plotting.
embedded_curve <- slingCurves(embedded)
curve <- lapply(embedded_curve, function(x) {
dat <- data.frame(x$s[x$ord, ]) # UMAP axis and the order of cells
return(dat)
})
names(curve) <- c("curve1", "curve2", "curve3")
head(curve$curve1)## UMAP1 UMAP2
## 1 -12.23250 0.6437628
## 2 -12.01319 0.7200463
## 3 -11.79388 0.7963086
## 4 -11.57455 0.8725464
## 5 -11.35521 0.9487473
## 6 -11.13585 1.0248888
We can now just add our curve information directly to our pseudotime UMAP plot using ggplot parameters geom_path().
plotUMAP(sce.sling, colour_by = "avg_pseudoTime") + geom_path(data = curve$curve1,
aes(x = UMAP1, y = UMAP2), color = "blue") + geom_path(data = curve$curve2, aes(x = UMAP1,
y = UMAP2), color = "black") + geom_path(data = curve$curve3, aes(x = UMAP1,
y = UMAP2), color = "red")The results indicated that lineage 3 is an outlier. If we look back at our lineage information this contains cluster 7.
embedded@metadata$lineages## $Lineage1
## [1] "8" "5" "9" "1" "2" "3" "4"
##
## $Lineage2
## [1] "8" "5" "9" "1" "2" "6"
##
## $Lineage3
## [1] "7"
We can remove cluster 7 and re-evaluate trajectories again. First we rerun the pseudotime estimation.
sce2 <- sce[, colData(sce)$label != 7]
sce.sling2 <- slingshot(sce2, cluster = sce2$label, reducedDim = "PCA", approx_points = 100,
omega = TRUE)
pseudo.paths <- slingPseudotime(sce.sling2)
avg_pseudoTime <- rowMeans(pseudo.paths, na.rm = TRUE)
colData(sce.sling2)$avg_pseudoTime <- avg_pseudoTimeNext we extract out the princple curves again.
embedded <- embedCurves(sce.sling2, "UMAP")
embedded_curve <- slingCurves(embedded)
curve <- lapply(embedded_curve, function(x) {
dat <- data.frame(x$s[x$ord, ]) # UMAP axis and the order of cells
return(dat)
})
names(curve) <- c("curve1", "curve2")plotUMAP(sce.sling2, colour_by = "avg_pseudoTime") + geom_path(data = curve$curve1,
aes(x = UMAP1, y = UMAP2), color = "blue") + geom_path(data = curve$curve2, aes(x = UMAP1,
y = UMAP2), color = "black")So far we have visualized the lineages together. We will want to break this down and compare to clusters. First we will look at clusters.
plotUMAP(sce.sling2, colour_by = "label")We can reveal which cluster belong to each cluster and lineage 1’s trajectory.
embedded@metadata$lineages$Lineage1## [1] "8" "5" "9" "1" "2" "3" "4"
plotUMAP(sce.sling2, colour_by = "slingPseudotime_1") + geom_path(data = curve$curve1,
aes(x = UMAP1, y = UMAP2))We can reveal which cluster belong to each cluster and lineage 2’s trajectory.
embedded@metadata$lineages$Lineage2## [1] "8" "5" "9" "1" "2" "6"
plotUMAP(sce.sling2, colour_by = "slingPseudotime_2") + geom_path(data = curve$curve2,
aes(x = UMAP1, y = UMAP2))Driver genes were the genes whose expression level changes reflect the pseudotime changes. We can find these using testPseudotime() from TSCAN. Using this the gene expression and pseudotime are fitted with a non-linear model, followed by testing with ANOVA.
We will start by just looking at lineage 1.
library(TSCAN)
res <- testPseudotime(sce.sling2, sce.sling2$slingPseudotime_1)We can see the results. Often we will want to reorder it too prioritize the biggest changers.
res$SYMBOL <- rownames(res)
res <- res[order(-res$logFC), ]
head(res)## DataFrame with 6 rows and 4 columns
## logFC p.value FDR SYMBOL
## <numeric> <numeric> <numeric> <character>
## ENSMUSG00000021728 0.0678996 0.00000e+00 0.00000e+00 ENSMUSG00000021728
## ENSMUSG00000016494 0.0654681 0.00000e+00 0.00000e+00 ENSMUSG00000016494
## ENSMUSG00000040747 0.0639249 7.23447e-296 5.16939e-293 ENSMUSG00000040747
## ENSMUSG00000057729 0.0619007 0.00000e+00 0.00000e+00 ENSMUSG00000057729
## ENSMUSG00000021998 0.0610830 0.00000e+00 0.00000e+00 ENSMUSG00000021998
## ENSMUSG00000090164 0.0608169 4.81695e-237 1.76510e-234 ENSMUSG00000090164
We can then categorize our data based on significance and FC.
res$stat <- NA
res$stat[res$logFC > 0 & res$FDR < 0.05] <- "late"
res$stat[res$logFC < 0 & res$FDR < 0.05] <- "early"
res$stat[is.na(res$stat)] <- "nc"
res[1:2, ]## DataFrame with 2 rows and 5 columns
## logFC p.value FDR SYMBOL stat
## <numeric> <numeric> <numeric> <character> <character>
## ENSMUSG00000021728 0.0678996 0 0 ENSMUSG00000021728 late
## ENSMUSG00000016494 0.0654681 0 0 ENSMUSG00000016494 late
We can quickly and easily see how many are upregulated/downregulated across the trajectory.
table(res$stat)##
## early late nc
## 11261 8848 25969
Using the order we can easily isolate the driver genes. Here we can see the top 5 early genes.
sub <- res[res$stat == "early", ]
sub <- sub[order(sub$logFC), ]
top <- head(sub$SYMBOL, 5)
top## [1] "ENSMUSG00000054191" "ENSMUSG00000004655" "ENSMUSG00000027556"
## [4] "ENSMUSG00000031877" "ENSMUSG00000031162"
With plotExpression() we can look at the expression of specific genes i.e. our top genes. We can see how each gene is expressed across all cells and psuedotime.
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
colour_by = "label")We can now repeat this for the top 5 late genes.
sub <- res[res$stat == "late", ]
sub <- sub[order(-sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
colour_by = "label")We can now repeat this for the top 5 late genes.
Now lets repeat this whole process for lineage 2.
res <- testPseudotime(sce.sling2, sce.sling2$slingPseudotime_2)
res$SYMBOL <- rownames(res)
res <- res[order(-res$logFC), ]
res$stat <- NA
res$stat[res$logFC > 0 & res$FDR < 0.05] <- "late"
res$stat[res$logFC < 0 & res$FDR < 0.05] <- "early"
res$stat[is.na(res$stat)] <- "nc"
table(res$stat)##
## early late nc
## 11566 8450 26062
sub <- res[res$stat == "early", ]
sub <- sub[order(sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
colour_by = "label")sub <- res[res$stat == "late", ]
sub <- sub[order(-sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
colour_by = "label")The CITE-Seq method labels different types of cells or different samples with hashtag-labeled antibodies. Then, the cells can be pooled together into a single library and be sequencing at the same time. After sequencing, the cells can be separated by the different antibody hashtags.
rna.mat <- readRDS("data/pbmc_umi_mtx.rds")
dim(rna.mat)## [1] 27117 16916
hto.mat <- readRDS("data/pbmc_hto_mtx.rds")
dim(hto.mat)## [1] 8 16916
rownames(hto.mat)## [1] "HTO_A" "HTO_B" "HTO_C" "HTO_D" "HTO_E" "HTO_F" "HTO_G" "HTO_H"
seu_obj <- CreateSeuratObject(counts = rna.mat, project = "citeSeq_demo")
seu_obj[["HTO"]] <- CreateAssayObject(counts = hto.mat)
seu_obj## An object of class Seurat
## 27125 features across 16916 samples within 2 assays
## Active assay: RNA (27117 features, 0 variable features)
## 1 layer present: counts
## 1 other assay present: HTO
DefaultAssay(seu_obj) <- "RNA"
seu_obj <- data_proc(seu_obj)
seu_obj <- ScaleData(seu_obj)seu_obj <- quick_clust(seu_obj)
DimPlot(seu_obj, group.by = "seurat_clusters", pt.size = 0.2, label = TRUE) + NoLegend()DefaultAssay(seu_obj) <- "HTO"
seu_obj <- NormalizeData(seu_obj, assay = "HTO", normalization.method = "CLR")RidgePlot(seu_obj, features = c("HTO-A", "HTO-B"), group.by = "hash.ID") + NoLegend()#
table(seu_obj$HTO_classification.global)##
## Doublet Negative Singlet
## 2598 346 13972
#
table(seu_obj$hash.ID)##
## Doublet HTO-H HTO-D HTO-E HTO-G HTO-F HTO-B HTO-C
## 2598 1808 1716 1487 1660 1520 1993 1873
## HTO-A Negative
## 1915 346
#
table(seu_obj$HTO_classification.global, seu_obj$hash.ID)##
## Doublet HTO-H HTO-D HTO-E HTO-G HTO-F HTO-B HTO-C HTO-A Negative
## Doublet 2598 0 0 0 0 0 0 0 0 0
## Negative 0 0 0 0 0 0 0 0 0 346
## Singlet 0 1808 1716 1487 1660 1520 1993 1873 1915 0
Single-cell quantification of a broad RNA spectrum reveals uniqure noncoding patterns associated with cell types and states. Isaknova A, Neff N, and Quake SR, PNAS (2021). link
Counting matrix of mouse embryo data was download from GEO GSE151334.
Data process workflow was recorded on GitHub. link
The SMART-Seq data pre-processing is quite similar to bulk RNA-seq, including align to reference genome and counting. There are two strategy to process from FQ files.
Details of the two methods can be found in the BRC training course for bulk RNAseq. Please refer to this link.
When considering whether to do pseudo-alignment or alignment, we do have to consider to pros and cons of these approaches. Pseudoaligner is much faster, but alignment gives you an exact location of the read. This can be important if you suspect RNA degradation or similar QC issues as you can check the 5’-3’ bias of the reads. This is not possible with 10X.
mtx <- read.delim("data/GSE151334_counts.mouse.tsv.gz")
mtx[1:2, 1:5]## mESC_EB_d0_A11_S11 mESC_EB_d0_A12_S12 mESC_EB_d0_A13_S13
## 0610005C13Rik 0 0 0
## 0610006L08Rik 0 0 0
## mESC_EB_d0_A14_S14 mESC_EB_d0_A15_S15
## 0610005C13Rik 0 0
## 0610006L08Rik 0 0
Once we have loaded the matrix, we can create a Seurat object. After this point we will be following a similar workflow to the 10X data. Here we start by finidng the mitochondrial content.
library(Seurat)
seu <- CreateSeuratObject(mtx, project = "GSE151334", min.cells = 3, min.features = 300)
seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^mt-")
seu <- SCTransform(seu, variable.features.n = 2000)library(reticulate)
reticulate::py_install("scrublet")
scr <- reticulate::import("scrublet")
mat <- GetAssayData(seu, assay = "RNA", slot = "counts")
mat <- as.matrix(mat)
scr <- reticulate::import("scrublet")
scrub <- scr$Scrublet(t(mat))
doublet <- scrub$scrub_doublets()
names(doublet) <- c("doublet_score", "doublet")
seu[["doublet_score"]] <- doublet$doublet_score
seu[["doublet"]] <- doublet$doubletLets read in annotated cell cycle genes from the Regev lab. This list of genes was curated by the authors of Seurat, and is a subset of the genes used in the original Cell Cycle paper.
set.seed(42)
seu <- RunPCA(seu, npcs = 50, verbose = FALSE)
seu <- FindNeighbors(seu, dims = 1:15, reduction = "pca")
seu <- RunUMAP(seu, dims = 1:15, reduction = "pca")
seu <- FindClusters(seu, resolution = 0.5)We can see here that this is a highly variable dataset. We can also see that there are some cells with high mitochondrial content in some samples. We will filter these out later.
library(ggplot2)
VlnPlot(seu, features = c("nCount_RNA", "nFeature_RNA", "percent.mt", "doublet_score",
"S.Score", "G2M.Score"), group.by = "seurat_clusters")FeatureScatter(seu, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", group.by = "seurat_clusters")Another view of our high mitochondrial content.
FeatureScatter(seu, feature1 = "nCount_RNA", feature2 = "percent.mt", group.by = "seurat_clusters") +
geom_hline(yintercept = 10, color = "black", linetype = "dashed")table(seu$doublet == "TRUE" | seu$percent.mt > 10)##
## FALSE TRUE
## 1559 373
set.seed(42)
seu <- subset(seu, subset = percent.mt <= 10 & doublet == "FALSE")
seu <- SCTransform(seu, variable.features.n = 2000, vars.to.regress = c("percent.mt"))
seu <- RunPCA(seu, npcs = 50, verbose = FALSE)
seu <- RunUMAP(seu, dims = 1:15, reduction = "pca")
seu <- FindNeighbors(seu, dims = 1:15, reduction = "pca")
seu <- FindClusters(seu, resolution = 0.5)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 1559
## Number of edges: 42165
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8941
## Number of communities: 10
## Elapsed time: 0 seconds
DimPlot(seu, group.by = "seurat_clusters", pt.size = 0.2, label = TRUE) + NoLegend()markers <- FindAllMarkers(seu, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
library(dplyr)
top_genes <- markers %>%
group_by(cluster) %>%
slice_max(n = 5, order_by = avg_log2FC)DoHeatmap(seu, features = c(top_genes$gene))VlnPlot(seu, features = c("Nanog", "Pou5f1"), ncol = 2, pt.size = 0)VlnPlot(seu, features = c("Pax6", "Stau2"), ncol = 2, pt.size = 0)VlnPlot(seu, features = c("Epcam", "Gata6"), ncol = 2, pt.size = 0)VlnPlot(seu, features = c("Dlk1", "Postn"), ncol = 2, pt.size = 0)seu[["layers"]] <- NA
seu$layers[seu$seurat_clusters %in% c(0, 4, 5)] <- "primary_ES"
seu$layers[seu$seurat_clusters %in% c(2, 3, 10, 11, 12)] <- "ectoderm"
seu$layers[seu$seurat_clusters %in% c(1, 9)] <- "mesoderm"
seu$layers[seu$seurat_clusters %in% c(6, 7, 8)] <- "endoderm"DimPlot(seu, group.by = "layers", pt.size = 0.2)